Error Bounds for Approximate Value Iteration
نویسنده
چکیده
Approximate Value Iteration (AVI) is an method for solving a Markov Decision Problem by making successive calls to a supervised learning (SL) algorithm. Sequence of value representations Vn are processed iteratively by Vn+1 = AT Vn where T is the Bellman operator and A an approximation operator. Bounds on the error between the performance of the policies induced by the algorithm and the optimal policy are given as a function of weighted Lp-norms (p ≥ 1) of the approximation errors. The results extend usual analysis in L∞-norm, and allow to relate the performance of AVI to the approximation power (usually expressed in Lp-norm, for p = 1 or 2) of the SL algorithm. We illustrate the tightness of these bounds on an optimal replacement problem. Introduction We study the resolution of Markov Decision Processes (MDPs) (Puterman 1994) using approximate value function representations Vn. The Approximate Value Iteration (AVI) algorithm is defined by the iteration Vn+1 = AT Vn, (1) where T is the Bellman operator and A an approximation operator or a supervised learning (SL) algorithm. AVI is very popular and has been successfully implemented in many different settings in Dynamic Programming (DP) (Bertsekas & Tsitsiklis 1996) and Reinforcement Learning (RL) (Sutton & Barto 1998). A simple version is: at stage n, select a sample of states (xk)k=1...K from some distribution μ, compute the backedup values vk := T Vn(xk), then make a call to a SL algorithm. This returns a function Vn+1 minimizing some average empirical loss Vn+1 = argminf 1 K ∑ k l(f(xk) − vk). Most SL algorithms use squared (L2) or absolute (L1) loss functions (or variants) thus perform a minimization problem in weighted L1 or L2, in which the weights are defined by μ. It is therefore crucial to estimate the performance of AVI as a function of the weighted Lpnorms (p ≥ 1) used by the SL algorithm. The goal of this paper is to extend usual results in L∞-norm to similar results in weighted Lp-norms. The performance achieved by such a resolution of the MDP Copyright c © 2005, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. may then be directly related to the approximation power of the SL algorithm. Alternative results in approximate DP with weighted norms include Linear Programming (de Farias & Roy 2003) and Policy Iteration (Munos 2003). Let X be the state space assumed to be finite with N states (although the results given in this paper extend easily to continuous spaces) and A a finite action space. Let p(x, a, y) be the probability that the next state is y given that the current state is x and the action is a. Let r(x, a, y) be the reward received when a transition (x, a) → y occurs. A policy π is a mapping from X to A. We write P π the N × N−matrix with elements P (x, y) := p(x, π(x), y) and r the vector with components r(x) :=
منابع مشابه
Error Bounds for Approximate Policy Iteration
In Dynamic Programming, convergence of algorithms such as Value Iteration or Policy Iteration results -in discounted problemsfrom a contraction property of the back-up operator, guaranteeing convergence to its fixedpoint. When approximation is considered, known results in Approximate Policy Iteration provide bounds on the closeness to optimality of the approximate value function obtained by suc...
متن کاملApproximate Policy Iteration for Markov Decision Processes via Quantitative Adaptive Aggregations
We consider the problem of finding an optimal policy in a Markov decision process that maximises the expected discounted sum of rewards over an infinite time horizon. Since the explicit iterative dynamical programming scheme does not scale when increasing the dimension of the state space, a number of approximate methods have been developed. These are typically based on value or policy iteration...
متن کاملPerformance bounds for λ policy iteration and application to the game of Tetris
We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connection...
متن کاملRobust Value Function Approximation Using Bilinear Programming
Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably finds an approximate value function that minimizes the Bellman residual. Sol...
متن کاملContinuous State Dynamic Programming via Nonexpansive Approximation
This paper studies fitted value iteration for continuous state numerical dynamic programming using nonexpansive function approximators. A number of approximation schemes are discussed. The main contribution is to provide error bounds for approximate optimal policies generated by the value iteration algorithm. Journal of Economic Literature Classifications: C61, C63
متن کامل